Game of Thrones Analysis¶
Game of Thrones is an American medieval fantasy drama series created by David Benioff and D.B Weiess and brought to screen by HBO. This series is based on the popular novel series A Song of Ice and Fire written by George R. R. Martin. The series is regarded as one of the most popular fantasy television shows in history, cementing it's place in pop culture with it's complex characters, detailed and expansive world-building and immersive story telling.
This series quickly became one of the most-watched shows in history, drawing in millions of viewers per episode and eventually having one of the most controversial endings for a show of such massive viewership and popularity. Yikes
The purpose of this project is to collect, clean, visualize and analyze data from this television series. I will be analyzing viewership trends and conduct statistical analysis using data obtained through webscraping.
Packages¶
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
from os import path
from wordcloud import WordCloud, STOPWORDS
import scipy.stats
from scipy.stats import f_oneway
Webscraping¶
Below, I have scraped two tables from the Game of Thrones Wikipedia episode list page: https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes
1.1 Series Overview: provides total # of episodes per season, season start and end dates and average viewership.
1.2 Series Episode Table: provides episode titles, directors, writers, air dates and viewership.
1.1 Series Overview¶
This dataset summarizes each of the 8 seasons of the show and provides a general overview.
- Season: The numbered season (1–8).
- Episodes: The total number of episodes in that season.
- First Released: The original premiere date for that season.
- Last Released: The date the final episode of the season aired.
- Avg_US_Viewers: The average number of U.S. viewers per episode (in millions).
url = "https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes"
headers = {
"User-Agent": "InClassDemoBot/1.0 (contact: smirwin1@asu.edu)"
}
page1 = requests.get(url, headers=headers)
soup1 = BeautifulSoup(page.content, 'html.parser')
body1 = soup1.find(id="mw-content-text")
overview_table = body1.find("table", {"class": "wikitable plainrowheaders"})
overview_rows = []
for row in overview_table.find_all("tr"):
cells = [c.get_text(" ", strip=True) for c in row.find_all(['th','td'])]
if not cells:
continue
if not cells[0].strip().isdigit():
continue
overview_rows.append(cells)
# Converts list of rows into a dataframe with column names
overview_df = pd.DataFrame(overview_rows, columns=[
"Season", "Episodes", "First_Released", "Last_Released", "Avg_US_Viewers"
])
# Removes wikipedia reference markers
overview_df["Avg_US_Viewers"] = (
overview_df["Avg_US_Viewers"]
.str.replace(r'\[.*?\]', '', regex=True)
.str.replace(',', '')
.str.strip()
)
overview_df["Avg_US_Viewers"] = pd.to_numeric(overview_df["Avg_US_Viewers"], errors='coerce')
overview_df["First_Released"] = overview_df["First_Released"].str.replace(r'\(.*?\)', '', regex=True).str.strip()
overview_df["Last_Released"] = overview_df["Last_Released"].str.replace(r'\(.*?\)', '', regex=True).str.strip()
overview_df
| Season | Episodes | First_Released | Last_Released | Avg_US_Viewers | |
|---|---|---|---|---|---|
| 0 | 1 | 10 | April 17, 2011 | June 19, 2011 | 2.52 |
| 1 | 2 | 10 | April 1, 2012 | June 3, 2012 | 3.80 |
| 2 | 3 | 10 | March 31, 2013 | June 9, 2013 | 4.97 |
| 3 | 4 | 10 | April 6, 2014 | June 15, 2014 | 6.84 |
| 4 | 5 | 10 | April 12, 2015 | June 14, 2015 | 6.88 |
| 5 | 6 | 10 | April 24, 2016 | June 26, 2016 | 7.69 |
| 6 | 7 | 7 | July 16, 2017 | August 27, 2017 | 10.26 |
| 7 | 8 | 6 | April 14, 2019 | May 19, 2019 | 11.99 |
1.2 Series Episode Table¶
This dataset comes from the season-episode tables I scraped from the Wiki page. This dataset contains detailed information for all 73 episodes of the show. Which includes:
- Episode number (overall and within the season)
- Episode title
- Director(s)
- Writer(s)
- Air date
- U.S. viewership (in millions)
This code scrapes the data by locating the season-episode tables with the class "wikitable plainrowheaders wikiepisodetable" and extracts all rows stored in a <tr> tag with the class "vevent". For each episode, the script collects the text from table cells <th> or <td> and forms a list of episode attributes. All the episodes are appended to episodeList. This raw data is then converted into a dataframe.
url = "https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes"
headers = {
"User-Agent": "InClassDemoBot/1.0 (contact: smirwin1@asu.edu)"
}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
body = soup.find(id="mw-content-text")
seasonTables = body.find_all("table", {"class": "wikitable plainrowheaders wikiepisodetable"})
# Grabs information from episode tables
episodeList = []
for seasons in seasonTables:
for items in seasons:
vevents = items.find_all("tr", {"class": "vevent"})
for attributes in vevents:
cells = [c.get_text(" ", strip=True) for c in attributes.find_all(['th', 'td'])]
episodeList.append(cells)
got_df = pd.DataFrame(episodeList)[0:73] # removes specials from episode list
got_df
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | " Winter Is Coming " | Tim Van Patten | David Benioff & D. B. Weiss | April 17, 2011 ( 2011-04-17 ) | 2.22 [ 21 ] |
| 1 | 2 | 2 | " The Kingsroad " | Tim Van Patten | David Benioff & D. B. Weiss | April 24, 2011 ( 2011-04-24 ) | 2.20 [ 22 ] |
| 2 | 3 | 3 | " Lord Snow " | Brian Kirk | David Benioff & D. B. Weiss | May 1, 2011 ( 2011-05-01 ) | 2.44 [ 23 ] |
| 3 | 4 | 4 | " Cripples, Bastards, and Broken Things " | Brian Kirk | Bryan Cogman | May 8, 2011 ( 2011-05-08 ) | 2.45 [ 24 ] |
| 4 | 5 | 5 | " The Wolf and the Lion " | Brian Kirk | David Benioff & D. B. Weiss | May 15, 2011 ( 2011-05-15 ) | 2.58 [ 25 ] |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 68 | 69 | 2 | " A Knight of the Seven Kingdoms " | David Nutter | Bryan Cogman | April 21, 2019 ( 2019-04-21 ) | 10.29 [ 89 ] |
| 69 | 70 | 3 | " The Long Night " | Miguel Sapochnik | David Benioff & D. B. Weiss | April 28, 2019 ( 2019-04-28 ) | 12.02 [ 90 ] |
| 70 | 71 | 4 | " The Last of the Starks " | David Nutter | David Benioff & D. B. Weiss | May 5, 2019 ( 2019-05-05 ) | 11.80 [ 91 ] |
| 71 | 72 | 5 | " The Bells " | Miguel Sapochnik | David Benioff & D. B. Weiss | May 12, 2019 ( 2019-05-12 ) | 12.48 [ 92 ] |
| 72 | 73 | 6 | " The Iron Throne " | David Benioff & D. B. Weiss | David Benioff & D. B. Weiss | May 19, 2019 ( 2019-05-19 ) | 13.61 [ 93 ] |
73 rows × 7 columns
Cleaning Data¶
After scraping the season-episode tables and creating a dataframe with the raw data, the new dataframe had many formatting issues. I manually added names to each column. I also removed quotation marks from the episode titles. The original dataframe contained two dates, I removed the extra date and then converted it into a datetime column. I removed any unecessary wikipedia reference markers.
The cleaned dataframe contains the columns:
- No_Overall: Overall episode number across the entire series.
- No_Series: Episode number within the season.
- Season: Season of the series.
- Episode: Episode title
- Director(s): Director name(s) for the episode.
- Writers(s): Writer name(s) for the episode.
- Air_Date: Original U.S. air date (YYYY-MM-DD).
- US_Viewers: U.S. viewership (in Millions).
# Added names to columns
got_df.columns = ['No_Overall', 'No_Series', 'Episode', 'Director(s)', 'Writer(s)', 'Air_Date', 'US_Viewers']
# Set the index of the Pandas dataframe to the overall episode number in the series
got_df.set_index('No_Overall', inplace=True)
# Removes '' from episode names
got_df['Episode'] = got_df['Episode'].str.replace(r'^"|"$', '', regex = True).str.strip()
# Removes the extra date
got_df['Air_Date'] = got_df['Air_Date'].str.replace(r'\(.*?\)', '', regex = True).str.strip()
got_df['Air_Date'] = pd.to_datetime(got_df['Air_Date'], errors='coerce')
# Removes wikipedia reference markers and converts to numeric
got_df['US_Viewers'] = (
got_df['US_Viewers']
.astype(str)
.str.replace(r'\[.*?\]', '', regex = True)
.str.replace(',', '')
.str.strip()
)
got_df['US_Viewers'] = pd.to_numeric(got_df['US_Viewers'], errors='coerce')
got_df
| No_Series | Episode | Director(s) | Writer(s) | Air_Date | US_Viewers | |
|---|---|---|---|---|---|---|
| No_Overall | ||||||
| 1 | 1 | Winter Is Coming | Tim Van Patten | David Benioff & D. B. Weiss | 2011-04-17 | 2.22 |
| 2 | 2 | The Kingsroad | Tim Van Patten | David Benioff & D. B. Weiss | 2011-04-24 | 2.20 |
| 3 | 3 | Lord Snow | Brian Kirk | David Benioff & D. B. Weiss | 2011-05-01 | 2.44 |
| 4 | 4 | Cripples, Bastards, and Broken Things | Brian Kirk | Bryan Cogman | 2011-05-08 | 2.45 |
| 5 | 5 | The Wolf and the Lion | Brian Kirk | David Benioff & D. B. Weiss | 2011-05-15 | 2.58 |
| ... | ... | ... | ... | ... | ... | ... |
| 69 | 2 | A Knight of the Seven Kingdoms | David Nutter | Bryan Cogman | 2019-04-21 | 10.29 |
| 70 | 3 | The Long Night | Miguel Sapochnik | David Benioff & D. B. Weiss | 2019-04-28 | 12.02 |
| 71 | 4 | The Last of the Starks | David Nutter | David Benioff & D. B. Weiss | 2019-05-05 | 11.80 |
| 72 | 5 | The Bells | Miguel Sapochnik | David Benioff & D. B. Weiss | 2019-05-12 | 12.48 |
| 73 | 6 | The Iron Throne | David Benioff & D. B. Weiss | David Benioff & D. B. Weiss | 2019-05-19 | 13.61 |
73 rows × 6 columns
Adding a season column to the table¶
# Make a copy of original DataFrame
got = got_df.copy()
# Ensure index is an integer
got.index = got.index.astype(int)
# Creates a new column named "Season"
got['Season'] = None
# Assigns seasons based on the episode ranges
got.loc[1:10, 'Season'] = 1
got.loc[11:20, 'Season'] = 2
got.loc[21:30, 'Season'] = 3
got.loc[31:40, 'Season'] = 4
got.loc[41:50, 'Season'] = 5
got.loc[51:60, 'Season'] = 6
got.loc[61:67, 'Season'] = 7
got.loc[68:73, 'Season'] = 8
season_data = got.pop('Season')
# Inserts "Season" column into position 1
got.insert(1, 'Season', season_data)
got
| No_Series | Season | Episode | Director(s) | Writer(s) | Air_Date | US_Viewers | |
|---|---|---|---|---|---|---|---|
| No_Overall | |||||||
| 1 | 1 | 1 | Winter Is Coming | Tim Van Patten | David Benioff & D. B. Weiss | 2011-04-17 | 2.22 |
| 2 | 2 | 1 | The Kingsroad | Tim Van Patten | David Benioff & D. B. Weiss | 2011-04-24 | 2.20 |
| 3 | 3 | 1 | Lord Snow | Brian Kirk | David Benioff & D. B. Weiss | 2011-05-01 | 2.44 |
| 4 | 4 | 1 | Cripples, Bastards, and Broken Things | Brian Kirk | Bryan Cogman | 2011-05-08 | 2.45 |
| 5 | 5 | 1 | The Wolf and the Lion | Brian Kirk | David Benioff & D. B. Weiss | 2011-05-15 | 2.58 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 69 | 2 | 8 | A Knight of the Seven Kingdoms | David Nutter | Bryan Cogman | 2019-04-21 | 10.29 |
| 70 | 3 | 8 | The Long Night | Miguel Sapochnik | David Benioff & D. B. Weiss | 2019-04-28 | 12.02 |
| 71 | 4 | 8 | The Last of the Starks | David Nutter | David Benioff & D. B. Weiss | 2019-05-05 | 11.80 |
| 72 | 5 | 8 | The Bells | Miguel Sapochnik | David Benioff & D. B. Weiss | 2019-05-12 | 12.48 |
| 73 | 6 | 8 | The Iron Throne | David Benioff & D. B. Weiss | David Benioff & D. B. Weiss | 2019-05-19 | 13.61 |
73 rows × 7 columns
Overview of what was done: After scraping the episode list page, I organized the raw information I had collected into a structured dataframe using Pandas. I then moved onto cleaning the data which involved adjusting the columns formats, removing data that wasn't part of the main show's episode list and ensuring that episodes were properly assigned to their respective seasons. I made sure to address any issues with string cleaning and ensured that numerical values like U.S. viewership were properly converted in numeric types to allow for later statistical calulcations and analyzing.
Visualization of Data¶
- Graph 1: Game of Thrones Viewership by Episode - MatPlotLib
- Graph 2: Distribution of viewership per season - Seaborn
- Graph 3: Episode Count per Director - Plotly¶
- Graph 4: Top 10 Most-Watched Game of Thrones Episodes & Their Directors - Plotly
Graph 1: Game of Thrones Viewership by Episode - MatPlotLib¶
A line graph that visualizes the U.S. viewership for every episode of Game of Thrones using MatPlotLib. Displays viewership for every episode, plotted in chronological order with labels above each point displaying the exact viewship amount in Millions.
# Size of figure
plt.figure(figsize = (14, 7))
plt.plot(got.index, got['US_Viewers'], marker = 'o', linewidth = 2, color='#8e44ad')
plt.title("Game of Thrones Viewership by Episode", fontsize = 14)
plt.xlabel("Episode Number", fontsize = 12)
plt.ylabel("U.S. Viewers (millions)", fontsize = 12)
# Add labels on points
for x, y in zip(got.index, got['US_Viewers']):
plt.text(x, y+0.2, f"{y:.1f}", ha='center', fontsize=8)
plt.grid(alpha = 0.3)
plt.show()
This line graph shows us how the variables change over time. Following the episodes chronological order, the line graph reveals an overall increase in viewship while revealing audience dips and spikes from one episode to the next. There are several significant spikes in viewership during important episodes such as season premieres and finales.
Graph 2: Game of Thrones Viewership by Season - Seaborn¶
This is a Seaborn boxplot that visualizes the U.S. viewership for every season of Game of Thrones. This plot effectively provides us a way to compare ditributions across seasons. Our line graph showed us changes over time while this boxplot highlights the range of viewership within each season which allows us to have a clear comparsion between the seasons.
sns.boxplot(x = 'Season',
y = 'US_Viewers',
hue = 'Season',
palette = 'mako',
data = got)
plt.title("Game of Thrones Viewership by Season", fontsize = 14)
plt.xlabel("Season", fontsize = 12)
plt.ylabel("U.S. Viewers (Millions)", fontsize = 12)
plt.grid(axis ='y', alpha = 0.3)
plt.show()
Seasons 1-6 show viewership steadily increasing. There is a big increase in viewership between seasons 6 and 7. Seasons 7 and 8 have higher medians and wider spreads. Season 8 being the season to show the most variability, most likely due to the controversial and mixed receptions of its episodes.
Graph 3: Episode Count per Director - Plotly¶
This bar graph summarizes how many episodes of Game of Thrones were directed by each director across all eight seasons. This visual allows us to see which directors were most involved throughout the runtime of the show.
To start, I created a copy of my "got" dataframe in order to separate the data I need.
# Copy of got dataframe
directors_df = got.copy()
# Replace "&" with "," in order to split directors
directors_df['Director(s)'] = directors_df['Director(s)'].str.replace("&", ",")
# Splitting into a list
directors_df['Director_List'] = directors_df['Director(s)'].str.split(",")
# Ungroup directors
directors = directors_df.explode('Director_List')
directors['Director_List'] = directors['Director_List'].str.strip()
# Count number of episodes per director
director_count = directors['Director_List'].value_counts().reset_index()
director_count.columns = ['Director', 'Episode_Count']
director_count
| Director | Episode_Count | |
|---|---|---|
| 0 | David Nutter | 9 |
| 1 | Alan Taylor | 7 |
| 2 | Alex Graves | 6 |
| 3 | Mark Mylod | 6 |
| 4 | Miguel Sapochnik | 6 |
| 5 | Jeremy Podeswa | 6 |
| 6 | Daniel Minahan | 5 |
| 7 | Michelle MacLaren | 4 |
| 8 | Alik Sakharov | 4 |
| 9 | Brian Kirk | 3 |
| 10 | Tim Van Patten | 2 |
| 11 | Neil Marshall | 2 |
| 12 | David Benioff | 2 |
| 13 | David Petrarca | 2 |
| 14 | Michael Slovis | 2 |
| 15 | D. B. Weiss | 2 |
| 16 | Daniel Sackheim | 2 |
| 17 | Jack Bender | 2 |
| 18 | Matt Shakman | 2 |
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
fig = px.bar(
director_count.sort_values("Episode_Count", ascending=False),
x = "Director",
y = "Episode_Count",
title = "Game of Thrones Episode Count per Director",
labels={
"Director": "Director Name",
"Episode_Count": "Number of Episodes"},
color = "Episode_Count",
color_continuous_scale = "Agsunset"
)
fig.update_layout(
xaxis_tickangle = -45,
xaxis_title = "Name of Director",
yaxis_title = "Number of Episodes",
font = dict(size = 14),
title_x = 0.5
)
fig.show()
Throughout its runtime, the series had 17 different directors. Out of all the directors, David Nutter has directed the most episodes of the show than any other. Alan Taylor comes in second with 7 episodes and then a few directors including Alex Graves, Mark Mylod, Miguel Sapochnik, and Jeremy Podeswa each directing 6 episodes. After these few directors, the distribution begins to flatten with many directors only having directed between 2 and 5 episodes.
Graph 4: Top 10 Most-Watched Game of Thrones Episodes & Their Directors - Plotly¶
This Plotly bar chart displays the top 10 most-watched episodes of Game of Thrones along with the directors who filmed them. The bars represent U.S. viewership in millions and the director(s) associated with each episode.
# Make copy of dataframe and sort by top 10 viewership
top10 = got.sort_values(by='US_Viewers', ascending = False).head(10).copy()
# Ungroup directors and split into list
top10['Director(s)'] = top10['Director(s)'].str.replace('&', ',', regex = False)
top10['Director_List'] = top10['Director(s)'].str.split(',')
top10['Director_List'] = top10['Director_List'].apply(lambda lst: [d.strip() for d in lst])
fig = px.bar(top10,
x = 'US_Viewers',
y = 'Episode',
orientation = 'h',
color='US_Viewers',
color_continuous_scale='Blues',
hover_data = ['Director(s)', 'Season'],
title = "Top 10 Most-Watched Game of Thrones Episodes & Their Directors",
labels={'US_Viewers': 'U.S. Viewers (millions)',
'Episode': 'Episode Title'
}
)
# Highest ranked at the top
fig.update_layout(yaxis=dict(autorange="reversed"))
fig.show()
Many of the highest-viewed episodes occur in Seasons 7 and 8, showing us that the show reached the height of its popularity near its conclusion. Many of the top-ranked episodes were directed by loved and well-known Game of Thrones directors such as Miguel Sapochnik and David Nutt, both of whom are associated with creating some of the best episodes in the entire series, having directed iconic episodes like "Battle of the Bastards", "Hardhome" and "The Rains of Castamere". We can see how high-stakes episodes and finales are highlighted here.
Statistical Analysis¶
Descriptive Analysis¶
This descriptive analysis serves as a summary for the distribution of U.S. episode viewership. I calculated several key descriptive statistics individually, including the mean, median, standard deviation, variance, minimum, and maximum. These values were then combined into a custom summary table for easier comparison.
# Found the mean, median, std, var, min and max viewership
mean_viewers = got['US_Viewers'].mean()
median_viewers = got['US_Viewers'].median()
std_viewers = got['US_Viewers'].std()
var_viewers = got['US_Viewers'].var()
min_viewers = got['US_Viewers'].min()
max_viewers = got['US_Viewers'].max()
# Storing values as data
statistics = {
"Mean": [mean_viewers],
"Median": [median_viewers],
"Standard Deviation": [std_viewers],
"Variance": [var_viewers],
"Minimum": [min_viewers],
"Maximum": [max_viewers]
}
# Creating a dataframe with stored data
statistical_df = pd.DataFrame(statistics, index=["US Viewership"])
statistical_df
| Mean | Median | Standard Deviation | Variance | Minimum | Maximum | |
|---|---|---|---|---|---|---|
| US Viewership | 6.447808 | 6.64 | 2.827372 | 7.994031 | 2.2 | 13.61 |
F Test in ANOVA¶
I am going to use this test to determine whether the average viewership numbers are different across the 8 seasons of Game of Thrones. This test compares the variation between seasons to the variation within seasons.
F-statistic: Measures how different the group means are. The higher the F values, the greater difference between seasons.
p-value: Our p-values tells us whether the difference is statistically significant. If p < 0.05, we reject the null hypothesis that tells us that all seasons have the same mean viewership.
# Creates a list of US_Viewers for each season
groups = [got[got['Season'] == s]['US_Viewers'] for s in got['Season'].unique()]
# Run the ANOVA test
f_stat, p_value = f_oneway(*groups)
f_stat, p_value
(np.float64(219.0004591550038), np.float64(1.1873296940055504e-42))
Our F- statistic resulted in a very large number. This means that the difference between season viewship are way larger than the differences within seasons. Our p-value is way below our threshold which means that there is a statistically significant difference in average viewship between seasons.
These statistical results support the trends we had seen in our earlier visualizations. That being that early seasons (1-3) had lower but steadily increasing viewership while we saw seasons 6-8 have dramatic increases.
Conclusion¶
So, after looking at the numbers...I am surprized. Coming from a fan of the show, I honestly thought there would've been a large number of people dropping the show after Season 7 due to the highly controversial way the show was being managed but we can see that the data tells us a different story. Viewsership actually rose going into season 8, by a lot, not only that but some of the most-watched episodes in the entire show came from season 8. This information also brings to light something I want to highlight and it's the fact that we have to take into consideration that high viewership does NOT equal happy viewers. I believe that many viewers (such as myself) stuck around solely to see how the directors would end the series.
The descriptive statistics demonstrated a wide range of viewership values with large jumps in later seasons. Our Visualizations also supported this trend. Our line graph revealed an upward growth across episodes, our boxplot showed an increasing median viewship with later seasons having wider spreads and our director focused plots highlighted how certain episodes, particularly those with action scenes tended to have higher average viewership.
Overall, this anaylsis demonstrated how data can reveal many patterns and how it can even challenge our expectations.
https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes